Add a new dataset Mercury #238

Elfsong · 2024-05-26T04:48:47Z

Motivation first:

TL;DR: Mercury is a dataset for evaluating computational efficiency of Python code generation.

Amidst the recent strides in evaluating Large Language Models for Code (Code-LLMs), existing benchmarks have mainly focused on functional correctness, overlooking the importance of computational efficiency.

To fill the gap, we present Mercury, the first computational efficiency benchmark for Code-LLMs. It comprises 1,889 Python tasks, each with adequate solutions to support a runtime distribution. Based on the distribution, we introduce a new metric Beyond, which computes a runtime-percentile-weighted Pass score to reflect functional correctness and computational efficiency simultaneously.

On Mercury, leading Code-LLMs can achieve 65% on Pass, while less than 50% on Beyond. Given that an ideal Beyond score would be aligned with the Pass score, it indicates that while Code-LLMs exhibit impressive capabilities in generating functionally correct code, there remains a notable gap in their efficiency. Finally, our empirical experiments reveal that Direct Preference Optimization (DPO) serves as a robust baseline for enhancing computational efficiency compared with Supervised Fine Tuning (SFT), which paves a promising avenue for future exploration of efficient code generation.\footnote{Our code and data are available on GitHub: [https://github.com/Elfsong/Mercury.
Write a full paragraph describing the feature;

In this work, we introduce Mercury, a novel code generation benchmark designed to assess and improve Code-LLM computational efficiency. It comprises 1,889 Python programming tasks with three difficulty stratification, which is divided into two datasets for model evaluation and fine-tuning separately. For each evaluation task, we assign a test case generator to remedy the shortfall of test case coverage. In measuring computational efficiency, the primary challenge stems from normalizing the absolute runtime across tasks that have diverse runtime ranges. Thus, we collect and locally execute numerous historical solutions for each task to form a runtime distribution and leverage the runtime percentile of LLM-generated code on the distribution instead of the absolute runtime to evaluate computational efficiency. Furthermore, to mitigate performance discrepancies attributed to irrelevant processes and diverse hardware configurations, we set up an isolated sandbox environment for task execution to establish local runtime distributions. More details can be found in the paper: https://arxiv.org/abs/2402.07844
Provide a code snippet that demonstrates its future use;

accelerate  launch --main_process_port 30000  main.py  \
    --model bigcode/starcoder2-7b   \
    --load_in_4bit   \
    --max_length_generation 2048   \
    --tasks mercury    \
    --n_samples 5  \
    --temperature 0.2  \
    --batch_size 5   \
    --allow_code_execution  \
    --save_generations  \
    --metric_output_path starcoder2-7b-mercury-result.json

In case this is related to a paper, please attach a link;

More details can be found in the paper: https://arxiv.org/abs/2402.07844

…on-harness into mercury

Elfsong · 2024-05-26T04:50:17Z

@SivilTaram FYI

loubnabnl

Hi, thanks you very much for submitting this interesting benchmark! LGTM, just two comments:

did you make sure the current implementation matches the scores reported in your paper for one of the public LLMs?
can you add some documentation about how to use the benchmark in the docs https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs

Elfsong · 2024-05-28T16:40:22Z

@loubnabnl Thank you so much for reviewing this code:)

did you make sure the current implementation matches the scores reported in your paper for one of the public LLMs?

Yes. The scores reported in our paper are based on this implementation. We are also working on publishing a public leaderboard page.

can you add some documentation about how to use the benchmark in the docs https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs

Sure. The instructions have been added. See this commit.

loubnabnl

Thanks! ready to merge 🚀

Add a new dataset Mercury

Elfsong and others added 20 commits April 14, 2024 09:40

init commit

4b1c8fc

init commit

3c5cb6e

Update

74f8e19

update

7117b47

update

6c2ebbf

update

f4e3912

update

8dd5e6c

update

98b7d44

Update

c7fe275

Update

0447089

Merge branch 'mercury' of https://github.com/Elfsong/bigcode-evaluati…

d9385f3

…on-harness into mercury

Prepare for merging

a1a930c

recoveing original code

b581e05

Remove print

5baa2c0

remove diffs

ba34e71

Update

2d5991a

Fix a diff

31d4b47

Format

4b05ad8

Update readme

49e6f24

Merge branch 'main' into mercury

67c4409

loubnabnl reviewed May 28, 2024

View reviewed changes

Elfsong added 2 commits May 28, 2024 16:26

Update usage doc

a90b5d6

Refine code format

b73221c

loubnabnl approved these changes May 29, 2024

View reviewed changes

loubnabnl merged commit f0f2b52 into bigcode-project:main May 29, 2024
1 check passed

Elfsong deleted the mercury branch May 30, 2024 05:37

phuonglvh pushed a commit to phuonglvh/bigcode-evaluation-harness that referenced this pull request Nov 15, 2024

Merge pull request bigcode-project#238 from Elfsong/mercury

0cdfda1

Add a new dataset Mercury

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new dataset Mercury #238

Add a new dataset Mercury #238

Elfsong commented May 26, 2024

Elfsong commented May 26, 2024

loubnabnl left a comment

Elfsong commented May 28, 2024

loubnabnl left a comment

Add a new dataset Mercury #238

Add a new dataset Mercury #238

Conversation

Elfsong commented May 26, 2024

Elfsong commented May 26, 2024

loubnabnl left a comment

Choose a reason for hiding this comment

Elfsong commented May 28, 2024

loubnabnl left a comment

Choose a reason for hiding this comment